Capturing the Moment: Strategies for Selection and Collection of Web-Based Resources to Document Important Social Phenomena

نویسنده

  • Christopher A. Lee
چکیده

The VidArch project is capturing YouTube videos and web pages associated with the 2008 U.S. presidential election. We are also exploring strategies and building tools for curators of digital collections to appraise and describe such materials. Blogs are an increasingly important source for documenting online deliberations. Blogs can provide commentary, but they can also serve as “contextual information bridges” for identifying and capturing resources to which the pages link. Web archiving literature usually defines collecting in terms of setting up a set of seeds for crawls based on specific URLs. However, a substantial portion of material on the Web is accessible through posing queries. Curators of digital collections will need tools and methods for combining information from queries and crawls to identify and collect materials. The VidArch project is developing and testing such approaches, in order to support what Hans Booms would call a “documentation plan” for reflecting the heterogeneous and interlinked conversation space surrounding contemporary events. Introduction The Web has become a vital forum of deliberation around issues of societal importance. The 2008 election for president of the United States, for example, is likely to be strongly influenced by materials posted to, shared and discussed on the Web. According to a December 2007 Pew survey, 24% of Americans report regularly learning about the campaign from the Internet, up from 13% in 2004 and 9% in 2000 [1]. According to the Pew survey, 24% of Americans report having seen something about the campaign in an online video. In order to make sense of the 2008 electoral process, future researchers would benefit not only from perpetual access to Web materials but also contextual information to make meaningful use and sense of the materials. The VidArch project [2] is capturing YouTube videos and web pages associated with the election, as well as exploring strategies and building tools for curators of digital collections to appraise and describe such materials. Importance of Documenting Online Deliberation Spaces YouTube allows for widespread dissemination of videos. According to a December 2007 Pew survey, 48% of American Internet users reported having “watch[ed] a video on a videosharing site like YouTube or GoogleVideo,” while 14% reported posting videos online that they had recorded [3]. These numbers are much higher among American “poli-fluentials,” who are expected to be most active in the 2008 election [4]. This provides new opportunities for relatively open discourse, while also challenging control of traditional authorities over predominant messages. In the 2006 U.S. elections, YouTube and MySpace “weaken[ed] the level of control that campaigns have over the candidate’s image and message since anybody, both supporters and opponents, can post a video and/or create a page on behalf of the candidates…” [5] YouTube is playing an increasingly important role in political discourse and may have a significant impact on voting behavior [6]. All major candidates for the U.S. presidential election created YouTube channels. YouTube and CNN jointly sponsored presidential debates, featuring video questions uploaded by YouTube users. Several candidates also posted videos to YouTube in which they posed specific questions and asked users to post video replies. Perhaps even more importantly, events that would have previously had only a very local impact can now attain widespread visibility and impact, because they are posted to YouTube. An even larger set of web sites provide links to and commentary about the content in YouTube. We use the term “blogosphere” to refer to the distributed and inter-linked body of blog (web log) pages. Blogs are based on software that allows for relatively easy additional of small entries to pages over time. As with YouTube, the blogosphere is a popular and influential space for political discourse. Like YouTube, the blogosphere is also provides space for extended discussion, speculation and agenda-pushing that might not happen in traditional media venues. Those who report daily use of political blogs are more likely to be at the ends of the political spectrum, and their political blog reading is strongly motivated by an interest in “news the mass media ignore” and a “different perspective on the news” [7]. Blog pages are more likely than other Web pages to provide out-links to “hubs,” often as a result of bloggers copying material out of “news items from key blog hubs and adding their own comments to them; in most cases this is done to let friends within the local peer network know what is interesting in the wider Web, while giving credit to the source” [8]. The above discussion suggests that, if a repository had the goal of documenting political deliberation surrounding the election, it would be well served by including in its collecting scope, not only “official” materials from the campaigns and mainstream media, but also content from these popular online interaction spaces, especially when repositories intend to serve as “curators of the experience as well as the record” [9]. Appraisal of Web Materials A fundamental challenge for curators of digital collections is appraisal, i.e. determining what segments of the documentary universe should be obtained and preserved. In a Web environment, appraisal can inform rules for crawls (sources, access points, filtering rules, and relevance criteria). Appraisal should be guided by notions of what one ultimately is trying to document. Documenting a contemporary phenomenon often requires cutting across numerous institutions and media [10]. In VidArch, we are addressing what we see as a gap between the literature on web archiving and established conceptions of archival appraisal. Twenty years ago, the translation of writing by Hans Booms introduced a new perspective to North American archival thought: appraisal should be based on best (i.e. most informed by empirical evidence) judgments of the “value ascribed by those contemporary to the material,” i.e. what members of society judged most valuable or important at the time documents were created [11]. If one accepts this approach, then a natural next question is how best to reflect the emphasis that people were placing on issues or materials at a given time. There is no single monolithic set of values or perceptions of "society" but one can use various data sources to what is most influential, viewed, discussed, and cited. Two assumptions underlying the work described in this paper are: online deliberation surrounding the U.S. presidential election process is important to document; and YouTube videos are playing a prominent role in the deliberation process, which warrants the preservation and contextualization of a subset of the videos. Enacting Appraisal Criteria through Crawling Web archiving tools and techniques have matured dramatically in recent years, and numerous institutions have taken on web archiving initiatives [12] [13]. Web capture has usually been based on identifying a set of seed uniform resources locators (URLs) and then recursively following links within a specified set of constraints (e.g. number of hops, specific domains). The “Arizona Model” is an important attempt to operationalize the archival principles of provenance and original order by mapping web crawling criteria to hierarchical structure of sites [14]. Web archiving based on recursive link following, however, faces two major challenges. First, link paths often do not map cleanly or directly to long-standing criteria for appraisal and collection development, e.g. topics, provenance, genres, dates. Second, a substantial portion of material on the Web is accessible through posing queries to databases, rather than following links. Several projects have demonstrated methods for scoping a topic-based crawl, based on automated analysis of the content of pages [15]. There have also been efforts to automatically populate web entry forms and collect pages that cannot be reached through link-following [13][16][17][18]. There has been relatively little investigation of combining link-following and queries to select complimentary sets of resources. Curators of digital collections will need tools and methods for combining information from both queries and crawls to identify and collect Web materials that document and contextualize phenomena. VidArch is developing and testing such approaches, in order to support what Booms would call a “documentation plan” for reflecting the heterogeneous and interlinked conversation space surrounding contemporary events. VidArch Approach The VidArch team has used the YouTube application program interfaces (APIs) to collect videos related to the 2008 U.S. presidential election, along with associated comments and other metadata, based on 57 queries to YouTube every day (except for days of maintenance), since May 2007 [19]. The queries include 50 names of individual candidates and 6 queries related to the election in general (e.g. “election 2008”). We use the term “crawl” to indicate one instance of executing the following two sets of activities: 1) submitting all 57 queries to YouTube and collecting data from the top 100 results of each query based on YouTube’s relevance ranking; and 2) collecting updated dynamic metadata for each video that has been “discovered” through an instance of step 1. When building long-term digital collections, it is essential not only to ensure continuing access to “target digital objects” but also to create, capture and manage contextual information to allow future users to understand, make sense of, analyze and use the target digital objects [20]. We are using data from YouTube and elsewhere on the Web (blogs, in-links identified by Web search engines, Web traffic data) to inform the appraisal of the YouTube videos and collect further contextual information associated with the videos. Collecting data from various sources allows us to identify likely strengths, gaps and complementarities. Collecting and Analyzing Blog Pages Beginning June 6, 2007, Fred Stutzman began a systematic collection of links from blog postings related to the 2008 U.S. presidential election. Queries related to 15 of the candidates were submitted through both Google Blogsearch and Technorati. For VidArch, a subset of blog postings were captured that either (1) included the name of a presidential candidate in their content or (2) linked to a candidate’s web site. The queries were run three times per hour, for a total of 72 queries per term each day. Once a given query set was retrieved, a web crawler created a “profile” for each blog page, reflecting its out-links. Within the resulting data set, an “out-link” is any link from the blog page to a resource outside that page; thus including links to other postings within a blog, navigational links, and links to ads or related postings. Contextual Information Bridges As noted above, blog entries are often relatively short snippets of text that then provide links to other resources where readings can discover more information [8]. Drezner and Farrell argue that blogs “are important less because of their direct effects on politics than their indirect ones—they influence important actors within mainstream media who in turn frame issues for a wider public” [21]. McKenna and Pole report, “By far the most popular activity for all political bloggers is providing readers with links to reports and articles found elsewhere.” [22] We are exploring the strengths and limitations of blog pages as “contextual information bridges” (CIBs), allowing curators to mine the postings to identify and then capture other online resources to which the pages link (i.e. those who linked to a given YouTube video also tended to link to some other explanatory source such as a newspaper article). A motivating example for considering blog pages as contextual information bridges is “Edwards Places Campaign Headquarters in NC” [23], which was produced and posted to YouTube by Carla Babb, a graduate student in journalism at the University of North Carolina. The video points out the irony of the campaign headquarters for John Edwards being located in an affluent neighborhood, given the Edwards campaign’s strong focus on alleviating poverty. The video drew controversy and media attention when the Edwards campaign allegedly demanded it be taken down from YouTube. It never appeared in the top 100 results for a “John Edwards” query to YouTube, which is an important reminder that videos that are influential and popular within YouTube might be missed by simple queries based solely on relevance rank. YouTube did list this video as #8 for most viewed videos in "News & Politics" for this week of 2007-10-30. We conducted a query within Google Blog Search for blog pages that link to this video. In addition to any information that the blog pages themselves provide about the video (which varies from a simple link with no further explanation, to a fairly detailed 319-word explanation of the controversy surrounding the video), we noted that many of the blog pages also linked to other online sources that provided further contextual information. See Figure 1 for an illustration of blog pages that link to this video serving as bridges to contextual information in other online sources. Obama Collection VidArch has analyzed relationships between YouTube and blog data, including overlap, consistency and relative relevance to the intended collecting scope, which are reported elsewhere [23]. In order to further investigate the potential role of blogs as sources of contextual information for online videos, we have more closely analyzed the blog and YouTube data related to Barack Obama. The blog crawler collected data for 136,687 blog pages in response to the Obama query. Those pages contain 1,468,533 outlinks, for an average of 10.74 out-links per page. Of those outlinks, 10,285 (.7%) are to YouTube videos, of which there are 6,903 unique videos. There are 4135 pages (3% of all blog pages generated from the Obama crawl) that represent potential contextual information bridges, because they contain out-links to at least one YouTube video. The 4135 pages contain a total of 170,990 out-links. Table 1 lists the 20 YouTube videos that received the most in-links from the blog pages, 10 of which are also in our collection based on crawling YouTube directly. Out-links from the crawls of blog pages are a less precise indicator of a video being “about Barack Obama” than is YouTube’s relevance ranking, which is consistent with our earlier findings across several of the candidates [25]. Of the videos in Table 1, four are focused on other candidates besides Obama, one is about a politician who was not a presidential candidate, and two that are not about election campaigns at all, none of which appeared in our collection resulting from crawls of YouTube. If one’s primary task is identifying a very focused set of YouTube videos to serve as the target digital objects to collect on a given topic, it appears that direct crawls of YouTube may be more effective than relying on links from crawled web pages, particularly if the topic can easily be translated into a simple query (e.g. name of a candidate). If, however, one has already identified a set of target YouTube videos to collect, and would like to identify further online resources that can provide contextual information associated with those videos, the blog pages could be much more valuable. We identified many cases of blog pages Figure 1 Example of Blogs as Contextual Information Bridges providing useful contextual information about a video either in the content of the blog entry itself or through a link to an informative textual, audio or video source. Table 1 Top 20 YouTube Videos Linked from "Barack Obama" Blog Pages (* = also in YouTube Crawl Set) Title YouTube ID Links

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

.The effect of information resources on the selection of strategies for adaptation to climate change by farmers (Case study: Golestan Province)

Background and Aim: The use of information resources is one of the important strategies in the selection of adaptation strategies to climate change by farmers. The aim of this study was to determine the effect of information resources on the selection of adaptation strategies to climate change by farmers in Golestan province. Method: The research was descriptive and survey. The statistical popu...

متن کامل

Designing and Validating the Service-Oriented University Model from the Standpoint of Higher Education Experts

Service orientation is a pivotal factor and a strategic direction for the university to keep with changes and perceptions of social needs. Accordingly, the main purpose of this study is to develop a model for the service-oriented university within the framework of service provision to the community. This research was conducted using a qualitative approach based on the grounded theory method. Th...

متن کامل

The relationship between social problem solving with acceptance and use of Web-based resources in educational-research activities adopting a Technology Acceptance Model (TAM)

Aim: Problem-solving is one of the most important issues in the field of psychology. It seems that solving social problems as an external variable has an important role in the acceptance of information technology. Therefore, the aim of the present study was to investigate the relationship between social problem solving and the use and acceptance of web-based resources in educational-research ac...

متن کامل

RRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features

Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Investigating the Impact of Authors’ Rank in Bibliographic Networks on Expertise Retrieval

Background and Aim: this research investigates the impact of authors’ rank in Bibliographic networks on document-centered model of Expertise Retrieval. Its purpose is to find out what kind of authors’ ranking in bibliographic networks can improve the performance of document-centered model.   Methodology: Current research is an experimental one. To operationalize research goals, a new test colle...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008